We adapt here the tutorial with demo data (see file RNASeq_in_R_demo.html) to study a ‘real’ scenario.

Setup

The code used to generate this document along with the data that you will need to reproduce the results presented here are available from the github repository VoisinneG/RNA_seq_analysis_in_R

Clone this repository from the terminal with the command:

git clone "https://github.com/VoisinneG/RNA_seq_analysis_in_R"

First things first : create a R project to store data, code, output tables and figures in separate sub-folders. Just click on “File > New Project…”

Create a new project. Add directories ./data, ./R, ./output and ./figures. Create a new R script file script.R in the ./R directory. We will use it to copy paste useful command lines below. We can re-run all commands sequentially by clicking on the source button in the top right corner of the R studio editor.

Report

You will have to report your analysis in the same format as this document. This document has been built from a Rmarkdown file using the knitr package. You can download the original .Rmd file here which could provide a useful starting point. You can find more information about Rmarkdown on the official website and in this cheatsheet .

Data

Read count data and sample metadata files are available from the github repository you have just cloned in the data folder. You can copy these file in your ./data folder.

Load data

We can now start by loading the data into the RStudio environment.

seqdata <- read.delim("./data/S16082_allresV2_rawdata.csv", 
                      stringsAsFactors = FALSE)
sampleinfo <- read.delim("./data/Sample_Info.csv")

You now have two objects in your environment. You can have a look at them using the head() or View() commands or by double clicking on an object within the environment. The seqdata object contains information about genes (one gene per row), the first column has the Entrez gene id, the second has the gene length and the remaining columns contain information about the number of reads aligning to the gene in each experimental sample.

names(seqdata)
##  [1] "Ensembl.gene.id"   "X1_28_Q_NT"        "X2_57_Q_NT"       
##  [4] "X4_80_Q_NT"        "X5_82_AM_NT"       "X6_24_AM_NT"      
##  [7] "X7_91_AM_NT"       "X8_47_Q_TNF"       "X9_23_Q_TNF"      
## [10] "X10_99_Q_TNF"      "X11_100_Q_TNF"     "X12_54_AM_TNF"    
## [13] "X13_54_AM_TNF"     "X15_94_AM_TNF"     "X17_3_Q_IS"       
## [16] "X18_7_Q_IS"        "X19_83_Q_IS"       "X20_84_Q_IS"      
## [19] "X21_1_Q_5ASA"      "X22_1_Q_5ASA"      "X23_21_Q_5ASA"    
## [22] "X24_22_Q_5ASA"     "X26_49_AM_5ASA"    "X27_49_AM_5ASA"   
## [25] "X28_85_AM_5ASA"    "X29_85_AM_5ASA"    "X31_20_AM_5ASA"   
## [28] "X32_20_AM_5ASA"    "X35_68_AM_TNF_RCH" "X36_55_AM_TNF_RCH"
## [31] "X37_55_AM_TNF_RCH" "X38_90_AS_Cort"    "X39_93_AS_Cort"   
## [34] "X40_44_AS_Cort"    "X41_96_AS_Cort"    "X42_2"            
## [37] "X43_9"             "X44_18"            "X45_41"           
## [40] "X46_50"            "Gene_name"

The sampleinfo file contains basic information about the samples that we will need for the analysis today.

sampleinfo
##       sampleName Treatment    Status Replicate
## 1     X1_28_Q_NT        NT quiescent         1
## 2     X2_57_Q_NT        NT quiescent         2
## 3     X4_80_Q_NT        NT quiescent         4
## 4    X5_82_AM_NT        NT    active         1
## 5    X6_24_AM_NT        NT    active         2
## 6    X7_91_AM_NT        NT    active         3
## 7    X8_47_Q_TNF       TNF quiescent         1
## 8    X9_23_Q_TNF       TNF quiescent         2
## 9   X10_99_Q_TNF       TNF quiescent         3
## 10 X11_100_Q_TNF       TNF quiescent         4
## 11 X12_54_AM_TNF       TNF    active         1
## 12 X13_54_AM_TNF       TNF    active         2
## 13 X15_94_AM_TNF       TNF    active         4
## 14         X42_2        NT   healthy         1
## 15         X43_9        NT   healthy         2
## 16        X44_18        NT   healthy         3
## 17        X45_41        NT   healthy         4
## 18        X46_50        NT   healthy         5

Formatting

We will be manipulating and reformatting the counts matrix into a suitable format for downstream analysis. The first two columns in the seqdata dataframe contain annotation information. We need to make a new matrix countdata containing only the counts, but we can store the gene identifiers (the EntrezGeneID column) as rownames.

# Edit countdata columns
idx_remove <- which(! names(seqdata) %in% sampleinfo$sampleName)
countdata <- seqdata[,-idx_remove]
# Store Ensembl GeneID as rownames
rownames(countdata) <- seqdata[,1]

Note that the column names are now the same as SampleName in the sampleinfo file. This is good because it means our sample information in sampleinfo is in the same order as the columns in countdata. Let’s also simplify sampleinfo by putting SampleName as row names.

rownames(sampleinfo) <- sampleinfo[,1]
sampleinfo <- sampleinfo[, -1]
table(colnames(countdata)==rownames(sampleinfo))
## 
## TRUE 
##   18

Filtering

Genes with very low counts across all libraries provide little evidence for differential expression and they interfere with some of the statistical approximations that are used later in the pipeline. They also add to the multiple testing burden when estimating false discovery rates, reducing power to detect differentially expressed genes. These genes should be filtered out prior to further analysis. Here we perform a minimal pre-filtering to keep only genes that have at least 1 read total.

keep <- rowMeans(countdata) >= 1
countdata <- countdata[keep, ]

If you want to normalize data and perform differential expression analysis, you can jump to section Creating a DESeqDataSet

Annotations

The only information we have about genes is their ENSEMBL Gene ID, which is not very informative. We would like to add some annotation information. There are a number of ways to do this. We will demonstrate how to do this using the org.Mm.eg.db package.

First we need to decide what information we want. In order to see what we can extract we can run the columns function on the annotation database.

library(org.Hs.eg.db)
columns(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GO"           "GOALL"        "IPI"          "MAP"          "OMIM"        
## [16] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"       "UNIGENE"     
## [26] "UNIPROT"

Entries in this database are ENTREZID and not ENSEMBL IDs. Let’s get all information from all entries using the select function. We can now match ENTREZ IDs and ENSEMBL IDs. We will store our annotation information in a separate data frame :

annot <- AnnotationDbi::select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), columns=c("ENTREZID", "ENSEMBL", "SYMBOL"))
idx_match <- match(row.names(countdata), annot$ENSEMBL)
annot <- annot[idx_match, ]
rownames(annot) <- rownames(countdata)

Let’s check that the ENSEMBL column matches the countdata rownames.

table(annot$ENSEMBL==rownames(countdata))
## 
##  TRUE 
## 20271

You might have noticed that some symbols appear as NA. That is because there is not a one-to-one match between EntrezID and Ensembl gene IDs. Gene names are also in seqdata, so we’ll use these instead.

annot$Gene_name <- seqdata$Gene_name[match(rownames(annot), seqdata$Ensembl.gene.id)]

Visualization

We are working we raw count data in this section. Data normalization and transformation will come later.

Heatmap

To explore a count matrix, it is often instructive to look at it as a heatmap. Below we show how to produce such a heatmap. We focus on the 20 genes with the highest average counts.

library("pheatmap")

select <- order(rowMeans(countdata), decreasing=TRUE)[1:20]

pheatmap(log10(countdata[select,]+1),
         cluster_rows=FALSE,
         cluster_cols=FALSE)

Look up the documentation for pheatmap in Rstudio. The parameter annotation_col allow us to add annotation columns. We’ll also cluster rows and columns , scale values by rows and change row labels using gene symbol.

pheatmap(log10(countdata[select,]+1), 
         show_rownames=TRUE, 
         cluster_cols=TRUE, 
         annotation_col=sampleinfo[ , c("Treatment","Status")] ,
         scale = "row",
         labels_row = annot$Gene_name[select]
         )

If you like interactive objects, you could try the heatmaply() function to generate the heatmap.

library(heatmaply)

heatmaply(log10(countdata[select,]+1), scale = "row")

Data visualization with ggplot2

Data visualization with ggplot2 works with data frames. Here, we convert our data into a data frame where all count values are stored in the same column named value (long format). This will allow us to fully exploit ggplot2 features.

library(reshape2)
library(dplyr)
df <- countdata
df$GeneID <- rownames(countdata)
df_melt <- melt(df, id.vars = "GeneID")
df_melt <- rename(df_melt,  sample = variable)

As an example, we can now plot the distribution of log10+1 transformed count values for each sample (You can offset histograms for better readability using the ggridges package).

plot <- ggplot(df_melt, aes(x=log10(value+1), fill = sample)) + 
  geom_density(alpha = 0.5)
plot

Box plots allow to see a summary of these distributions.

plot <- ggplot(df_melt, aes(x=sample, y = log10(value+1), fill = sample)) + 
  theme(axis.text.x = element_text(angle = 90)) +
  geom_boxplot(alpha = 0.5)
plot

Normalisation

Some samples seem to differ quite neatly from the other samples. Some normalization between samples will be needed before going further into the analysis.

Here we use the dplyr package to group rows and return group summary (see the corresponding chapter in R for data science ). It makes it easy to compute the median count value per sample.

df_median <- 
  df_melt %>% 
  group_by(sample) %>% 
  summarise(median = median(value, na.rm = TRUE))

df_median
## # A tibble: 18 x 2
##    sample        median
##    <fct>          <dbl>
##  1 X1_28_Q_NT        31
##  2 X2_57_Q_NT        33
##  3 X4_80_Q_NT        20
##  4 X5_82_AM_NT       46
##  5 X6_24_AM_NT       39
##  6 X7_91_AM_NT       45
##  7 X8_47_Q_TNF       32
##  8 X9_23_Q_TNF       52
##  9 X10_99_Q_TNF      17
## 10 X11_100_Q_TNF     11
## 11 X12_54_AM_TNF     32
## 12 X13_54_AM_TNF     42
## 13 X15_94_AM_TNF     35
## 14 X42_2             43
## 15 X43_9             50
## 16 X44_18            63
## 17 X45_41            31
## 18 X46_50            54

Build a data frame with the median and mean count per gene across samples.

Note that normalization will also be carried out later during differential expression analysis. Here we show how we can use the dplyr package to normalize the data using the median. In passing, we also log transform the data.

median_mean <- mean(df_median$median)

df_melt <- 
  df_melt %>% 
  group_by(sample) %>% 
  mutate(value_norm = value/median(value, na.rm = TRUE)*median_mean,
         value_norm_log = log10(value_norm + 1))

ggplot(df_melt, aes(x=sample, y = value_norm_log, fill = sample)) + 
  theme(axis.text.x = element_text(angle = 90)) +
  geom_boxplot(alpha = 0.5)

We can build a normalized data frame using the reverse transform dcast:

data.norm.log <- dcast(df_melt, GeneID~sample, value.var = "value_norm_log")
rownames(data.norm.log) <- data.norm.log$GeneID
data.norm.log <- data.norm.log[-1]

Using metadata

Things get more interesting when we include metadata. Let’s map sample information to df_melt

idx_match <- match(df_melt$sample, rownames(sampleinfo))
df_melt$Status <- sampleinfo$Status[idx_match]
df_melt$Treatment <- sampleinfo$Treatment[idx_match]

Now we have access to two more variables, Treatment and Status, that we could use to form groups.

ggplot(df_melt, aes(x = Treatment, y = value_norm_log, fill = Treatment)) +
  geom_bar(alpha = 0.5, stat = "summary", fun.y = "median", position = "dodge") + 
  facet_wrap(~Status)

We can also focus on a particular set of GeneIDs and split plots using facet_grid. Let’s add an unpaired t-test on top of that.

library(ggsignif)
idx_select <- df_melt$GeneID %in% df_melt$GeneID[1]


ggplot(df_melt[idx_select, ], aes(x = Treatment, y = value_norm_log, fill = Treatment)) +
  geom_bar(alpha = 0.5, stat = "summary", fun.y = "median", position = "dodge") + 
  geom_point(position = position_jitter(width = 0.25, height = 0))+
  scale_y_continuous(expand = expand_scale(add=1)) +
  geom_signif(comparisons = list(1:2), 
              na.rm = TRUE, test = "t.test", 
              test.args = list("paired" = FALSE), 
              position = "identity") +
  facet_grid(GeneID~Status)

QC - Sample correlation matrix

To evaluate the quality of the data, we compare values across samples using the sample correlation matrix.

data.norm.log <- dcast(df_melt, GeneID ~ sample, value.var = "value_norm_log")
rownames(data.norm.log) <- data.norm.log$GeneID
data.norm.log <- data.norm.log[-1]
corr_mat = cor( data.norm.log )
heatmaply(corr_mat)

Let’look more in more details at two samples :

library(ggpointdensity)

ggplot(data.norm.log, aes_string(x = rownames(sampleinfo)[3], y = rownames(sampleinfo)[4])) + 
    geom_pointdensity(size = 0.3, alpha = 0.1 ) + 
    scale_color_viridis() + 
    annotate("segment", x =0, y =0, xend = 7, yend = 7, linetype = "dashed")

PCA

Principal component analyisis (PCA) is a great tool to see the overall “shape” of the data. It allows to identify which samples are similar to one another and which are very different. This can enable us to identify groups of samples that are similar and work out which variables make one group different from another. We use the prcomp function to run the PCA. Usually, features (here genes) are columns while observation (here samples) are rows so we’ll transpose the previously median normalized data.norm.log before running the PCA.

pca_res <- prcomp( t(data.norm.log) )
summary(pca_res)
## Importance of components:
##                            PC1     PC2      PC3      PC4      PC5      PC6
## Standard deviation     26.2502 20.9285 14.68125 13.84242 12.17558 10.74431
## Proportion of Variance  0.2698  0.1715  0.08438  0.07501  0.05804  0.04519
## Cumulative Proportion   0.2698  0.4412  0.52562  0.60063  0.65866  0.70386
##                             PC7     PC8     PC9    PC10   PC11    PC12    PC13
## Standard deviation     10.34392 9.52615 9.07145 8.71818 8.3053 7.92569 7.79867
## Proportion of Variance  0.04189 0.03553 0.03222 0.02976 0.0270 0.02459 0.02381
## Cumulative Proportion   0.74575 0.78127 0.81349 0.84324 0.8702 0.89484 0.91865
##                           PC14    PC15    PC16    PC17      PC18
## Standard deviation     7.59092 7.37775 7.12285 6.70914 6.095e-14
## Proportion of Variance 0.02256 0.02131 0.01986 0.01762 0.000e+00
## Cumulative Proportion  0.94121 0.96252 0.98238 1.00000 1.000e+00

You obtain 18 principal components, each one explaining a percentage of the total variation in the dataset (PC1 explains 27%, PC2 17% and so on). The relationship (correlation or anticorrelation, a.k.a loadings) between the initial variables and the principal components is in $rotation. Let’s see which genes have the greatest loadings

max_loading <- apply(pca_res$rotation, 1, max)
pca_max <- apply(pca_res$rotation, 1, which.max)

idx_select<- order(max_loading, decreasing = TRUE)[1:50]
annotation_row <- data.frame(pc = factor(pca_max[idx_select]))
rownames(annotation_row) <- rownames(data.norm.log)[idx_select]

Let’s look at the heatmap for these genes:

pheatmap(data.norm.log[idx_select, ],
         show_rownames=TRUE,
         annotation_col=sampleinfo[ , c("Treatment","Status")],
         annotation_row = annotation_row,
         labels_row = annot$Gene_name[ match(rownames(data.norm.log)[idx_select], rownames(annot))]
)

You can also select genes with the highest positive and negative loadings on a given PC:

idx_select<- c(order(pca_res$rotation[, 1], decreasing = TRUE)[1:20],
               order(pca_res$rotation[, 1], decreasing = FALSE)[1:20])
annotation_row <- data.frame(pc = c(rep("up", 20), rep("down", 20) ))
rownames(annotation_row) <- rownames(data.norm.log)[idx_select]

pheatmap(data.norm.log[idx_select, ],
         show_rownames=TRUE,
         annotation_col=sampleinfo[ , c("Treatment","Status")],
         annotation_row = annotation_row,
         labels_row = annot$Gene_name[ match(rownames(data.norm.log)[idx_select], rownames(annot))]
)

The values of each sample in terms of the principal components is in $x. Check taht the row names pca_res$x are the same as that of sampleinfo:

pca_res$x[ , 1:2]
##                      PC1         PC2
## X1_28_Q_NT    -17.393326  13.2705716
## X2_57_Q_NT     -5.804470  24.2066753
## X4_80_Q_NT     13.687838  -3.9781948
## X5_82_AM_NT    28.012442  -1.4174009
## X6_24_AM_NT   -17.055372  10.4438811
## X7_91_AM_NT    -5.219547  -3.5208188
## X8_47_Q_TNF    -6.839841  14.2835688
## X9_23_Q_TNF    12.275765  17.2709704
## X10_99_Q_TNF  -30.876275 -37.4686320
## X11_100_Q_TNF -21.090980 -61.7823048
## X12_54_AM_TNF  80.542757 -13.5139218
## X13_54_AM_TNF  22.919972  -6.5191274
## X15_94_AM_TNF  16.617856   0.5096783
## X42_2         -19.107976   8.8780438
## X43_9          -4.084852  10.2593812
## X44_18         -8.399046   9.4286515
## X45_41        -22.425481  17.8864512
## X46_50        -15.759463   1.7625274

We’ll use the ggplot2 package to plot the results of the PCA. See the R for data science book for an introduction to ggplot2. We first need to create a data frame from the pca results.

df_pca <- as.data.frame(pca_res$x)
df_pca$sample <- rownames(df_pca)
names(df_pca)
##  [1] "PC1"    "PC2"    "PC3"    "PC4"    "PC5"    "PC6"    "PC7"    "PC8"   
##  [9] "PC9"    "PC10"   "PC11"   "PC12"   "PC13"   "PC14"   "PC15"   "PC16"  
## [17] "PC17"   "PC18"   "sample"

We can now use the ggplot() function and choose to draw one point per sample.

library(ggplot2)
library(ggrepel)

pca_plot <- ggplot(df_pca, aes(x=PC1, y=PC2, color = sample, label=sample)) + 
  geom_point(show.legend = FALSE) + 
  geom_text_repel(show.legend = FALSE)
pca_plot

As an exercise, check that the row names pca_res$x are the same as that of sampleinfo (check it) and map metadata directly to the df_pca dataframe. Color points according to the sample Status or Treatment.

Correlations

At some point in the analysis, we might want to perform pairwise comparisons between genes. Let’s first create a data frame (note that we use the transpose of the countdata matrix so available variables are in names(df)).

df <- as.data.frame(t(data.norm.log))

Plotting the data for two different genes is a good way to identify functionnal relationships between them. Let’s pick two genes,

library(ggplot2)
xvar <- as.name(names(df)[1])
yvar <- as.name(names(df)[2])
p <- ggplot(df, aes_string(x=xvar, y=yvar)) + geom_point(alpha = 0.5)
p

and fit a linear model to the data and compute Pearson’s correlation coefficient.

lm_res <- lm(formula =`ENSG00000000005` ~ `ENSG00000000003`, data = df)
sm<-summary.lm(lm_res)
sqrt(sm$r.squared)
## [1] 0.858598

Now we want to repeat this for all pairs of genes (we use only a subset of genes) and build a correlation matrix

library(Hmisc)
idx_select<- order(max_loading, decreasing = TRUE)[1:50]
corr <- Hmisc::rcorr(t(data.norm.log[idx_select, ]))
pheatmap(corr$r, fontsize = 4)

We can explore correlations by constructing a data frame with the correlation corefficient, its associated p-value and the number of points used to compute the correlation

library(reshape2)
df_R_coeff <- melt(corr$r)
df_R_coeff$variable <- "R"
df_pval <- melt(corr$P)
df_pval$variable <- "P"
df_nval <- melt(corr$n)
df_nval$variable <- "n"

df_corr <- rbind(df_R_coeff, df_pval, df_nval)
df_corr <- dcast(df_corr, Var1 + Var2 ~ variable, mean)
df_corr$gene_name_1 <- annot$Gene_name[match(df_corr$Var1, rownames(annot))]
df_corr$gene_name_2 <- annot$Gene_name[match(df_corr$Var2, rownames(annot))]
df_corr <- df_corr[df_corr$Var1 != df_corr$Var2, ]

We can look at the whole correlation data frame using the interactive function datatable. It is convenient if you wish to search for certain variables or order results.

library(DT)
DT::datatable(df_corr)

Now we can filter and keep only the most correlated pairs

df_corr_filtered <- df_corr[!is.na(df_corr$P) & df_corr$n>10 & df_corr$P<0.05 & df_corr$R>0.9, ]

We use that restricted set to build a network.

library(igraph)

net <- igraph::graph.data.frame(df_corr_filtered[ , c("gene_name_1", "gene_name_2")], directed=FALSE)
net <- igraph::simplify(net)
clusters <- igraph::cluster_fast_greedy(as.undirected(net))
group = clusters$membership
names(group) <- clusters$names

And we render it in an interactive environment

library(networkD3)
net_d3 <- networkD3::igraph_to_networkD3(net, group = group)
p <- forceNetwork(Links = net_d3$links, Nodes = net_d3$nodes,
                  Source = 'source', Target = 'target',
                  fontFamily = "arial",
                  NodeID = 'name', Group = 'group',
                  colourScale = JS("d3.scaleOrdinal(d3.schemeCategory20);"),
                  charge = -10, opacity = 1,
                  linkColour = rgb(0.75, 0.75, 0.75),
                  fontSize = 12, bounded = TRUE, zoom=TRUE, opacityNoHover = 1)
p

Alternatively, we can save it as a text file and use another software such as cytoscape to manipulate this network.

dir.create("./output/")
write.table(df_corr_filtered, file = "./output/df_corr_filtered.txt", sep = "\t", row.names = FALSE, quote = FALSE)
write.table(data.frame(clusters$membership, clusters$names), file = "./output/group.txt", sep = "\t", row.names = FALSE, quote = FALSE)

Exercise: add an extra column df$score <- 1:length(rownames(df)) that contains a score associated to each sample to our data frame. Which genes most strongly correlate with this score?

Differential expression

Creating a DESeqDataSet

We will use the DESeq2 package to conduct the core steps of the analysis. It should be already installed so we just need to load it in the RStudio session.

library(DESeq2)

We can use the help("DESeq2-package") command to browse the package documentation in Rstudio. We will use the DESeqDataSetFromMatrix function to build a DESeqDataSet object.

dds <- DESeqDataSetFromMatrix(countData = countdata,
                              colData = sampleinfo,
                              design = ~ Treatment + Status)

The design indicates how to model the samples, i.e how the counts for each gene depend on the variables in colData = sampleinfo. Here we want to measure the effect of the cell type and of the pregnancy status of the mice. Note that the two factor variables Treatment and Status should be columns of sampleinfo. Now the dds object contains count data along with the metadata and the experiment design. The count data is obtained using counts(dds) and the metadata is obtained using colData(dds).

Now that we have a DESeqDataSet object, we can analyse the data using the many tools available in the DESeq2 package.

Differential expression analyis

The standard differential expression analysis steps are wrapped into a single function, DESeq. It is then straightforward to perform differential expression analysis.

dds <- DESeq(dds)

We can see the comparisons carried out by DESeq using resultsNames().

resultsNames(dds)
## [1] "Intercept"                  "Treatment_TNF_vs_NT"       
## [3] "Status_healthy_vs_active"   "Status_quiescent_vs_active"

Results tables are generated using the function results, which extracts a results table with log2 fold changes, p values and adjusted p values. We specify which coomparison we want to extract using the name parameter. Here we choose to compare basal and luminal cell types. We select significant results with adjusted p-values (correction for multiple tests computed with the Benjamini–Hochberg procedure) below 0.01 using alpha.

res <- results(dds, alpha = 0.05, pAdjustMethod="BH", name="Status_quiescent_vs_active")
head(res)
## log2 fold change (MLE): Status quiescent vs active 
## Wald test p-value: Status quiescent vs active 
## DataFrame with 6 rows and 6 columns
##                          baseMean     log2FoldChange             lfcSE
##                         <numeric>          <numeric>         <numeric>
## ENSG00000129451  26.1748456742507  -0.59412036935381   0.2587337780261
## ENSG00000205420 0.982580594914044  -2.93453103817323  2.10952255838686
## ENSG00000169035  7.80444717174889 -0.888004457168178 0.488624734035383
## ENSG00000088340  685.378954759117  -1.04974882266852 0.330274957645292
## ENSG00000109511  5.28342772962168  -2.44407244320362  1.39680814205695
## ENSG00000071242  443.601832207341 -0.365417424016576 0.187158049050984
##                              stat             pvalue               padj
##                         <numeric>          <numeric>          <numeric>
## ENSG00000129451 -2.29626133041615 0.0216609437059134  0.237524902153958
## ENSG00000205420 -1.39108777315814  0.164198812074885                 NA
## ENSG00000169035 -1.81735470047629 0.0691628192437002  0.420362197594208
## ENSG00000088340 -3.17840877235365 0.0014808581113695 0.0483347219798127
## ENSG00000109511 -1.74975529538685                 NA                 NA
## ENSG00000071242 -1.95245369285204 0.0508843636655219  0.365056173443204

We can get a summary of these results using summary()

summary(res)
## 
## out of 31010 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 118, 0.38%
## LFC < 0 (down)     : 633, 2%
## outliers [1]       : 295, 0.95%
## low counts [2]     : 7162, 23%
## (mean count < 5)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Let’s select genes have an a adjusted p-value lower than 0.05 and a fold-change greater than 2^2:

enrich <- data.frame(ID=rownames(res), res@listData)

enrich  <- enrich  %>% 
  filter(padj < 0.05, log2FoldChange > 2) %>%
  arrange(desc(log2FoldChange))

enrich %>% head(10)
##                 ID   baseMean log2FoldChange    lfcSE     stat       pvalue
## 1  ENSG00000260075  14.495252       7.445833 1.699484 4.381232 1.180101e-05
## 2  ENSG00000156284 787.177133       5.468534 0.873294 6.261962 3.801635e-10
## 3  ENSG00000169271  30.965038       4.806102 1.382521 3.476333 5.083214e-04
## 4  ENSG00000167759   5.326531       4.628798 1.024176 4.519533 6.197609e-06
## 5  ENSG00000277010   5.556916       4.551041 1.200091 3.792246 1.492906e-04
## 6  ENSG00000091583   5.308916       4.535677 1.131813 4.007444 6.137949e-05
## 7  ENSG00000179796   5.892397       4.471098 1.124978 3.974386 7.056097e-05
## 8  ENSG00000178084   4.936564       4.370418 1.071060 4.080460 4.494669e-05
## 9  ENSG00000257612   9.577385       4.103931 1.249055 3.285629 1.017551e-03
## 10 ENSG00000080224   7.100705       4.057451 1.038739 3.906131 9.378583e-05
##            padj
## 1  1.561512e-03
## 2  4.712627e-07
## 3  2.380217e-02
## 4  9.930087e-04
## 5  9.686619e-03
## 6  5.200255e-03
## 7  5.730767e-03
## 8  4.174302e-03
## 9  3.798156e-02
## 10 6.946345e-03

Have a look at the most enriched gene

#get normalized data
d <- plotCounts(dds, gene=as.character(enrich$ID[1]), intgroup=c("Treatment","Status"), 
                returnData=TRUE)

library("ggplot2")
p <- ggplot(d, aes(x=Status, y=log10(count+1))) + 
  geom_point(position=position_jitter(w=0.1,h=0))  +
  facet_wrap(~Treatment)
p

We can also facet the plot according to Status. It could be interesting to see how the Treatment effect depends on Status. See section Interactions to address such questions.

ggplot(d, aes(x=Treatment, y=log10(count+1))) + 
  geom_point(position=position_jitter(w=0.1,h=0))  +
  facet_wrap(~Status)

We have used default parameters here. Data normalization and model fitting have been performed in the background by DESeq. If you want to perform these steps separately, have a look at the Additional Material section.

Interactions

Interaction terms allow to test, for example, if the log2 fold change attributable to a given condition is different based on another factor, for example if the Treatment effect differs across Status.

Interaction terms can be added to the design formula but a simpler approach to take into account interactions consists in: -combining the factors of interest into a single factor with all combinations of the original factors -changing the design to include just this factor, e.g. ~ group

dds.int <- dds
dds.int$group <- factor(paste(dds.int$Status, dds.int$Treatment, sep="."))
design(dds.int) <- ~ group
dds.int <- DESeq(dds.int)
resultsNames(dds.int)
## [1] "Intercept"                        "group_active.TNF_vs_active.NT"   
## [3] "group_healthy.NT_vs_active.NT"    "group_quiescent.NT_vs_active.NT" 
## [5] "group_quiescent.TNF_vs_active.NT"

Now we can compare active and quiescent patients that received the TNF treatment.

#res.int <- results(dds.int, contrast = c("group", "active.TNF", "quiescent.TNF"))
res.int <- results(dds.int, contrast = c("group", "active.NT", "quiescent.NT"))
res.int
## log2 fold change (MLE): group active.NT vs quiescent.NT 
## Wald test p-value: group active.NT vs quiescent.NT 
## DataFrame with 31010 rows and 6 columns
##                          baseMean    log2FoldChange             lfcSE
##                         <numeric>         <numeric>         <numeric>
## ENSG00000129451  26.1748456742507 0.937847364045463  0.37466711622884
## ENSG00000205420 0.982580594914044  4.47830019282281  5.14574796546978
## ENSG00000169035  7.80444717174889 0.546284275261462 0.740871487912225
## ENSG00000088340  685.378954759117 0.673352579982499 0.480607726904312
## ENSG00000109511  5.28342772962168 -2.37222965387234  2.10174490238311
## ...                           ...               ...               ...
## ENSG00000283557  1.36687788738831 0.699331388957992  1.56315664140943
## ENSG00000283580  2.75420568937431 0.574170190826705  1.10645358085756
## ENSG00000283597  2.08568607229969 0.688426276392395  1.47378995984917
## ENSG00000283638  1.17573897595086  1.91454074538724  2.28486824420557
## ENSG00000283667  2.41739220935345   1.5272315290754  1.17506471764344
##                              stat            pvalue              padj
##                         <numeric>         <numeric>         <numeric>
## ENSG00000129451  2.50314832399821 0.012309394506137 0.996750113873654
## ENSG00000205420 0.870291398427237 0.384141178398879                NA
## ENSG00000169035 0.737353622287302 0.460907331057749                NA
## ENSG00000088340   1.4010440163326 0.161200911051718 0.996750113873654
## ENSG00000109511 -1.12869532890625                NA                NA
## ...                           ...               ...               ...
## ENSG00000283557 0.447384075550762 0.654597771648423                NA
## ENSG00000283580  0.51892840401103 0.603810669441014                NA
## ENSG00000283597 0.467112882532358 0.640419117769962                NA
## ENSG00000283638 0.837921727102873  0.40207466528852                NA
## ENSG00000283667  1.29969992813521 0.193703835017983                NA

Alternatively, you can create another object keeping only the samples you want to compare :

samples <- rownames(sampleinfo)[sampleinfo$Status %in% c("quiescent", "active") & 
                                  sampleinfo$Treatment == "NT"]

dds.focus <- DESeqDataSetFromMatrix(countData = countdata[,samples],
                              colData = sampleinfo[rownames(sampleinfo) %in% samples, ],
                              design = ~ Status)

dds.focus <- DESeq(dds.focus)
resultsNames(dds.focus)
## [1] "Intercept"                  "Status_quiescent_vs_active"

Visualizing results

MA-plot

In DESeq2, the function plotMA shows the log2 fold changes attributable to a given variable over the mean of normalized counts for all the samples in the DESeqDataSet.

plotMA(res.int, ylim=c(-5, 5))

Volcano plot

Volcano plots allow to visualize p-values and fold-change. We add gene names for a set of selected genes

df <- as.data.frame(res.int@listData)
df$ID <- res.int@rownames
df$names <- annot$Gene_name[match(df$ID, rownames(annot))]
select <- abs(df$log2FoldChange) > 2 & df$padj < 0.05
ggplot(df, aes(x=log2FoldChange, y=-log10(padj), label = names)) + 
  geom_point(color = "gray") +
  geom_point(data = df[select, ], color = "red") +
  geom_text_repel(data = df[select, ], color = "black", 
                  min.segment.length = 0, segment.colour = "gray")

Annotation enrichment analysis

Now that we have identified a set of genes that are differentially expressed between basal and luminal conditions, we can use available gene annotations to figure out whether this set of genes is enriched in terms corresponding to specific biological processes, molecular functions or pathways. We’ll use the clusterProfiler package to perform this analysis on GO annotation terms corresponding to molecular functions (GO MF ontology).

library(clusterProfiler)

df <- as.data.frame(res.int@listData)
df$ID <- res.int@rownames
df$names <- annot$Gene_name[match(df$ID, rownames(annot))]
select <- abs(df$log2FoldChange) > 2 & df$padj < 0.05

enrichgo_result_over_MF = enrichGO( gene = df$ID[select], 
                                    keyType = "ENSEMBL",
                                    universe = df$ID,
                                    OrgDb = org.Hs.eg.db,
                                    ont = "MF",
                                    pvalueCutoff = 0.05,
                                    qvalueCutoff = 0.2,
                                    readable = TRUE)

Results are stored in the result slot

datatable( enrichgo_result_over_MF@result )

Check the clusterProfiler documentation if you wish to look at other ontologies.

For instance, for KEGG pathways :

df$EntrezID <- annot$ENTREZID[match(df$ID, rownames(annot))]
enrichgo_result_over_KEGG = enrichKEGG( gene = df$EntrezID[select], organism = "hsa",
                                    keyType = "kegg",
                                    universe = df$EntrezID,
                                    pvalueCutoff = 0.05,
                                    qvalueCutoff = 0.2)

GSEA

Complementary to the previous approach, GSEA (Gene Set Enrichment Analysis) consists in analyzing how sets of genes of interest (also called “genesets” or “signatures”) are distributed along some ranking (i.e. of ratios) between two conditions (i.e. luminal versus basal). It calculates an enrichment score and some statistics (p-value and FDR) for each geneset. Download the gene set file from here. Several sets can be dowloaded. We choose the Hallmark gene sets with “gene symbols” and save it in ./data. You should now have a file h.all.v7.1.symbols.gmt in the ./data folder. Let’s read it:

#read the geneset file
pathwaysH <- read.csv("./data/h.all.v7.1.symbols.gmt", sep="\t", header=F, stringsAsFactor=FALSE)

The gene set file needs to be in a specific format so we need to do a little formatting before using the fgsea() function:

#read the geneset file
# the format is different than when we do: load("human_H_v5p2.rdata")
# this causes a problem. 
genesets <- lapply( 1:nrow(pathwaysH), function(x){
  row <- pathwaysH[x, ]
  current_string =  paste( unlist( row[ 3:length( row)], use.names = FALSE), collapse="\t")
  current_string <- gsub("[\t]+$","", current_string)
  current_string <- strsplit(current_string, split="\t")[[1]]
  return(current_string)}
)
names(genesets) <- pathwaysH[[1]]

We also need to rank genes. We choose to rank them according to the log2 fold-change stored in our res object. But before that, we need to remove duplicate genes in order to keep only distinct gene symbols. We decide to keep the most regulated ones.

ranks <- data.frame(ID = rownames(res.int), log2FC = res.int$log2FoldChange)
idx_match <- match(rownames(res.int), rownames(annot))

ranks$symbol <- toupper(annot$Gene_name[idx_match])
ranks <- ranks[order(abs(ranks$log2FC), decreasing = TRUE), ]
ranks <- distinct(ranks, symbol, .keep_all = TRUE)
ranks <- ranks[order(ranks$log2FC, decreasing = TRUE), ]
stats <- ranks$log2FC
names(stats) <- ranks$symbol

Run GSEA

library(fgsea)
fgseaRes <- fgsea(pathways = genesets, stats = stats, minSize=15, maxSize = 500, nperm=1000)

Lets look at the top 10 most enriched gene sets:

topUp <- fgseaRes %>% 
  filter(NES > 0) %>% 
  arrange(padj, desc(NES)) %>%
  head(10)
  
topUp %>% dplyr::select(pathway, padj, NES)
##                                        pathway        padj      NES
##  1:           HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.005417705 2.377539
##  2:             HALLMARK_INFLAMMATORY_RESPONSE 0.005417705 2.219018
##  3:           HALLMARK_IL6_JAK_STAT3_SIGNALING 0.005417705 1.884133
##  4:         HALLMARK_INTERFERON_GAMMA_RESPONSE 0.005417705 1.877065
##  5:               HALLMARK_ALLOGRAFT_REJECTION 0.005417705 1.815710
##  6:                       HALLMARK_COAGULATION 0.005417705 1.812685
##  7: HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 0.005417705 1.805185
##  8:                        HALLMARK_COMPLEMENT 0.005417705 1.762153
##  9:               HALLMARK_IL2_STAT5_SIGNALING 0.005417705 1.600879
## 10:                 HALLMARK_KRAS_SIGNALING_UP 0.005417705 1.578338

and at the top 10 most depleted gene sets :

topDown <- fgseaRes %>% 
  filter(NES < 0) %>% 
  arrange(padj, desc(NES)) %>%
  head(10)

topDown %>% dplyr::select(pathway, padj, NES)
##                                pathway       padj        NES
##  1:       HALLMARK_PANCREAS_BETA_CELLS 0.03633721 -1.6192425
##  2: HALLMARK_OXIDATIVE_PHOSPHORYLATION 0.03633721 -2.5907251
##  3:            HALLMARK_G2M_CHECKPOINT 0.03706449 -1.9407357
##  4:               HALLMARK_E2F_TARGETS 0.03706449 -2.5292712
##  5:            HALLMARK_MYC_TARGETS_V1 0.03706449 -2.7097979
##  6:            HALLMARK_MYC_TARGETS_V2 0.04201681 -1.5720664
##  7:     HALLMARK_FATTY_ACID_METABOLISM 0.14595496 -1.2420703
##  8:                HALLMARK_DNA_REPAIR 0.21161985 -1.1645424
##  9:      HALLMARK_BILE_ACID_METABOLISM 0.55542698 -1.0669442
## 10:         HALLMARK_PROTEIN_SECRETION 0.76211214 -0.9792845

Let’see how the genes within a pathway are distributed along the ranked genes:

plotEnrichment(genesets[[topUp$pathway[1]]], stats)

We can also look at several pathways at once:

topPathways <- bind_rows(topUp, topDown) %>% 
  arrange(-ES)

plotGseaTable(genesets[topPathways$pathway], 
              stats, 
              fgseaRes, 
              gseaParam = 0.5)

Re-run the analysis using the GO MF gene set. What differentiate GSEA from our previous annotation enrichment analysis?

Additional Material

Normalization

Normalization is performed using estimateSizeFactors. This function first normalize rows by the geometric mean and returns the median value for each column.

dds.norm <-  estimateSizeFactors(dds)
sizeFactors(dds)
##    X1_28_Q_NT    X2_57_Q_NT    X4_80_Q_NT   X5_82_AM_NT   X6_24_AM_NT 
##     0.9840113     1.1777534     0.5163510     1.2505036     1.1530409 
##   X7_91_AM_NT   X8_47_Q_TNF   X9_23_Q_TNF  X10_99_Q_TNF X11_100_Q_TNF 
##     1.2456205     0.9368798     1.5653890     0.4751203     0.2477557 
## X12_54_AM_TNF X13_54_AM_TNF X15_94_AM_TNF         X42_2         X43_9 
##     1.1221783     1.0838932     0.9211399     1.3260912     1.5299650 
##        X44_18        X45_41        X46_50 
##     1.7712017     1.0289108     1.5250236

Normalized data can be retrieved using counts(dds.norm, normalized=TRUE). Let’s see what the normalized data looks like:

library(reshape2)
library(ggplot2)
df_norm  <- as.data.frame(counts(dds.norm, normalized=TRUE))
df_norm[["GeneID"]] <- rownames(df_norm)
df_melt_norm <- melt(df_norm , id.vars = "GeneID")
df_melt_norm <- dplyr::rename(df_melt_norm,  sample = variable)

ggplot(df_melt_norm, aes(x=sample, y = asinh(value), fill = sample)) + 
  theme(axis.text.x = element_text(angle = 90)) +
  geom_boxplot(alpha = 0.5)

ggplot(df_melt_norm, aes(x = asinh(value), fill = sample)) + 
  geom_density(alpha = 0.5)

Mean-variance relationship

As you can see from the following plot the relationship between variance and mean is not linear as would be the case for a Poisson distribution (red line).

## Computing mean and variance
df_norm <- as.data.frame( counts(dds.norm, normalized=TRUE) )
mean <- apply(df_norm, 1, mean, na.rm = TRUE)
variance <- apply(df_norm, 1, var, na.rm = TRUE)

df_norm$log10_mean <- log10(mean)
df_norm$log10_variance <- log10(variance)

p1<-ggplot(df_norm, aes(x=log10_mean, y=log10_variance)) + 
  geom_hex(bins=100) +
  scale_fill_viridis_c()+
  annotate("segment", x = -1, y = -1, xend = 6, yend=6, color = "red")

p1

Modelling read counts through a negative binomial

To perform diffential expression call DESeq will assume that, for each gene, the read counts are generated by a negative binomial distribution. One problem here will be to estimate, for each gene, the two parameters of the negative binomial distribution: mean and dispersion.

-The mean will be estimated from the observed normalized counts in both conditions. -The first step will be to compute a gene-wise dispersion. When the number of available samples is insufficient to obtain a reliable estimator of the variance for each gene, DESeq will apply a shrinkage strategy, which assumes that counts produced by genes with similar expression level (counts) have similar variance (note that this is a strong assumption). DESeq will regress the gene-wise dispersion onto the means of the normalized counts to obtain an estimate of the dispersion that will be subsequently used to build the binomial model for each gene. Performing estimation of dispersion parameter and display a diagnostic plot

## Performing estimation of dispersion parameter
dds.disp <- estimateDispersions(dds.norm)

## A diagnostic plot which
## shows the mean of normalized counts (x axis)
## and dispersion estimate for each genes
plotDispEsts(dds.disp)

####Performing differential expression

Now that a negative binomial model has been fitted for each gene, the nbinomWaldTest can be used to test for differential expression.

dds.wald <- nbinomWaldTest(dds.disp)
resultsNames(dds.wald)
## [1] "Intercept"                  "Treatment_TNF_vs_NT"       
## [3] "Status_healthy_vs_active"   "Status_quiescent_vs_active"

As before, we can see the result table. We keep only results with adjusted p-values (correction for multiple tests computed with the Benjamini–Hochberg procedure) below 0.01 using alpha.

res.wald <- results(dds.wald, alpha=0.01, pAdjustMethod="BH", name = "Treatment_TNF_vs_NT")
summary(res.wald)
## 
## out of 31010 with nonzero total read count
## adjusted p-value < 0.01
## LFC > 0 (up)       : 1, 0.0032%
## LFC < 0 (down)     : 0, 0%
## outliers [1]       : 295, 0.95%
## low counts [2]     : 0, 0%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
head(res.wald)
## log2 fold change (MLE): Treatment TNF vs NT 
## Wald test p-value: Treatment TNF vs NT 
## DataFrame with 6 rows and 6 columns
##                          baseMean     log2FoldChange             lfcSE
##                         <numeric>          <numeric>         <numeric>
## ENSG00000129451  26.1748456742507 -0.105548142588466 0.257652266152146
## ENSG00000205420 0.982580594914044  -1.27754692918506  2.11092452947951
## ENSG00000169035  7.80444717174889  0.171729485898849 0.485260537455315
## ENSG00000088340  685.378954759117  0.422868878052248 0.330407401532974
## ENSG00000109511  5.28342772962168   2.23932580388208  1.41019131148955
## ENSG00000071242  443.601832207341  0.125631343090971 0.187306069789873
##                               stat            pvalue              padj
##                          <numeric>         <numeric>         <numeric>
## ENSG00000129451 -0.409653461096046 0.682060173830069 0.961707675063548
## ENSG00000205420 -0.605207297250016 0.545041267808835  0.93380033275686
## ENSG00000169035  0.353891307130373 0.723420341930571  0.96948936181376
## ENSG00000088340   1.27984081497656 0.200601126167282 0.817639500400674
## ENSG00000109511   1.58795887170567                NA                NA
## ENSG00000071242  0.670727559613572 0.502394102717808 0.922651170851925